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Summary. Hierarchical models allow for heterogeneous behaviours in a population while simulta¬ 
neously borrowing estimation strength across all subpopulations. Unfortunately, existing likelihood- 
based methods for fitting hierarchical models have high computational demands, and these de¬ 
mands have limited their adoption in large-scale prediction and inference problems. This paper 
proposes a moment-based procedure for estimating the parameters of a hierarchical model which 
has its roots in a method originally introduced by Cochran in 1937. The method trades statistical 
efficiency for computational efficiency. It gives consistent parameter estimates, competitive predic¬ 
tion error performance, and substantial computational improvements. When applied to a large-scale 
recommender system application and compared to a standard maximum likelihood procedure, the 
method delivers competitive prediction performance while reducing the sequential computation time 
from hours to minutes. 

Keywords: Hierarchical model; Generalized linear mixed model; Recommender systems; 
Statistical-computational trade-off 


1. Introduction 


Hierarchical models are appropriate when we collect data from multiple sub-populations or 
groups, each of which exhibits different associations between the measured variables. Each 
group can be a particular classroom, firm, city, time period, or any member of a class of similar 
entities. Rather than ignoring the subpopulation structure and assuming that all observations 
are independent, a hierarchical model accounts for the dependence of the observations within 
a group by allowing for random subpopulation-specific effects. These models and more general 
mixed models are widely applied in the natural and social sciences, and many reference books 


describe them in detail (Snijders and Bosker, 2012 Scott et al. 2013). 


By explicitly allowing for between-group variability, hierarchical models hold two main ad¬ 
vantages over models that do not. First, in accounting for this variability, a hierarchical model 


is able to give more accurate uncertainty estimates for population parameter estimates (Rao 


1965). Second, by drawing strength across similar experimental units, a hierarchical model can 


give better group-specific predictions (Reinsel 1985). The latter phenomenon is closely related 


to the performance of Stein’s shrinkage estimators (Morris 1983). 


One seemingly-appropriate application for hierarchical models is in recommender systems, 
where the goal is to take historical data about users, items, and user ratings of these items to 


learn users’ preferences and to make recommendations based on these preferences (Adomavicius 


and Tuzhilin 2005). Here, users correspond to groups, and user-specific preferences correspond to 


random effects. In fact, early in the development of recommender systems, Condliff et al. (19991 
and Ansari et al. (20001 advocated for the use of these models and more general mixed models 
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due to their potential to combine content-based filtering (recommending based on item-specific 
attributes) and collaborative filtering (recommending based on preferences of similar users). 

Despite their advantages, in the late 2000s, many authors deemed the computational costs 
required to fit a hierarchical model to be prohibitively high for recommender systems and other 


similar applications in commercial-scale settings (Zhang and Koren 


2007 

Agarwal 

to 

o 

o 

00 

Naik 


et al. 2008 Agarwal and Chen 2009). Most methods for fitting these models and related factor 
models are iterative, with a high computational cost for each iteration. Letting q denote the 
number of fixed and random effects in the model, methods based on expectation-maximization 


(Dempster et al. 

1981] 

Zhang and Agarwal 

2009 Agarwal and Chen 

2009) 

variational approx- 

imations (Armagan and Dunson 

2011), likelihood maximization ( 

Goldstein 

1986 

Jennrich and 

Schluchter 

1986 

Longford 

1987 

Lindstrom and Bates 1988), and profile likelihood maximiza- 


followed by a series of iterations, each with computational costs proportional Mq 3 or Mq 4 , where 
M is the number of groups. This can be substantial when M and N are both large. 

In cases where the predictors are sparse, it is possible to exploi t this str ucture to achieve speed- 
ups on the order of q or q 2 , which can be dramatic if q is large (Zhang and Koren 2007). This, 


however, requires special structure in the predictor matrices and imposes sparsity constraints on 
the parameter estimates. 

In general situations, one can partition the data between multiple processors, compute sepa¬ 


rate parameter estimates for each chunk, and then combine the results (Huang and Gelman 2005 


Gebregziabher et al. 

2012 

Khanna et al. 

2013 

Scott et al., 

2013 


These splitting strategies 
often require the same total computational cost, but they split the costs between K processors, 
reducing wall clock time by a factor of K. An alternative approach is to approximate the data 
likelihood using a form of h-likelihood and then optimize the resulting criterion via stochastic 


gradient descent (Koren et al. 2009 Dror et al. 20111. This requires a series of iterations, each 


with computation costs proportional to Nq, often leading to a lower overall fitting time. 

In this report, we propose an alternative approach, revisiting and extending a moment-based 


estimation procedure originally due to Cochran (1937). In this approach, we fit group-specific 


estimates in isolation, then combine these estimates to get population parameter estimates by 
matching moments. The main advantage of the approach over existing alternatives is that it is 
not iterative. There is an initial cost proportional to Nq 2 , followed by a fixed cost proportional 
to Mq 4 . Due to memory locality, in practice the dominant cost is often proportional to M. The 
procedure can be trivially distributed across K processors, reducing computation by a factor 
of K. 

Kg. a demonstrates the potential advantages of the moment-based estimation method. This 
figure shows the amount of CPU time required by three different procedures—maximum like¬ 
lihood ( glmer ), stochastic gradient descent ( sgd ), and the proposed method ( mhglm )- fitting 
hierarchical models to subsets of the MovieLens 10M recommender system dataset (GroupLens 


2009). The first two methods are implemented in a mix of R, C, and C++; the proposed method 


is implemented in R. In this example, the computational costs required for the first two methods 
appear to scale linearly with the sample size, IV, while for the latter, the dominant computational 
costs appear to be proportional to M. At the largest value of N reported, the proposed method 
is 50 times faster than glmer , and 1.7 times faster than sgd (90 times faster if we include the 
cross-validation time required to choose the tuning parameter for sgd). Notably, even if glmer 
were split across 10 processors, running the proposed method on a single CPU would still be 
faster by a factor of 4. 

In this report, we demonstrate that the proposed moment-based estimation procedure is 
often faster than likelihood-based methods. The improvements in computational efficiency do 
not come free; they are paid for by sacrificing some statistical efficiency. In many large-sample 
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Fig. 1 . Computational scaling properties for hierarchical model fitting procedures. 


regimes, the loss in statistical efficiency is small or modest, and it becomes worthwhile to make 
this statistical-computational trade-off. 

We introduce hierarchical models in more detail in Section [2] Next, in Section [3] we describe 
the proposed moment-based fitting procedure. This procedure depends on a choice of weights, 
which we discuss in Section [4] In Sections [5] and [6] we derive finite-sample and asymptotic 
properties for the estimators, including consistency, relative efficiency, and asymptotic normality. 
We investigate performance in simulations in Section [7] Finally, we apply the method to a 
recommender system application in Section [8] and close with a brief discussion in Section [9j The 
on-line supplementary material contains Appendices [A] [F] with additional details and technical 
lemmas. 

The proposed method is implemented in the mbest R package, available at http://cran. 
r-project.org/web/packages/mbest/. Data and software to generate the figures in this paper 
are available at http://ptrckprry.com/reports/. 

2. Hierarchical models 

Consider a collection of M subpopulations or groups. In group i we observe rii random response 
values denoted individually as y l7 (j = 1,... ,rii), or jointly as the vector y i with j th component 
equal to yij for j = 1,... ,ni. The total number of observations is N = n i- Suppose that 

each observation has two associated predictor vectors: a vector Xjj of dimension p, and a 
vector Zij of dimension q. In matrix form, let Xi and Z t be the corresponding predictor matrices 
of dimensions rii x p and rii x q , with row j equal to Xij or z, 3 , respectively, for j = 1,... ,rij. 
Our goal will be to use the N observations to estimate the association between the response yij 
and the feature vectors Xi 7 and z^. 

In a hierarchical linear model, we posit that conditional on a vector Uj of group-specific 
random effects, the expectation of the response vector is determined by the relation 

E(y i | u.i) = Xi(3 + Z t Ui, (1) 

where f3 is a vector of p fixed population effects shared across all M groups. Further, we assume 
that within each group the response values are independent, with conditional variances given by 
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| ui) = cr 2 . Lastly, we take the random effect vectors u\,... ,Um to be independent and 
identically distributed with mean zero and covariance matrix co v(ui) = X for some positive- 
semidefinite matrix X. 

Hierarchical generalized linear models are natural extensions of hierarchical linear models 


that allow for non-linear relations between the response and the effects (|Lee and Nelder 
The set-up is similar to that for a hierarchical linear model 


i 


1996). 


with 
Instead 


but we replace the relation 

the nonlinear relation E{y i \ ui) = g~ 1 (Xi/3 + Ziui) for some specified link function gi. 
of a variance parameter cr 2 , we have a dispersion parameter <f> (possibly known). 

For a hierarchical linear model or hierarchical generalized linear model, given observations 
y = (y 1; ..., y M ) our main inferential task is estimating the population parameters (3, X, and cr 2 . 
Once these estimates have been obtained, they can be used together with the data to estimate 
(formally, predict) the random effect vectors iti,..., %, typically using a Gaussian approxima¬ 
tion to the conditional distribution iq | y t with plug-in estimates for quantities involving /3, X, 
and (j>. In turn, the estimated effect vectors can be used to forecast future response values. 

Our primary focus in this report is developing a computationally efficient method for esti¬ 
mating (3 , X, and cr 2 . We focus on applications where the number of groups, M, is large, with 
a small or moderate number of predictors (p + q -C M). 


3. Moment-based estimation 

3.1. Overview 

Before likelihood-based fitting procedures for hierarchical models became ubiquitous, Cochran 
developed a moment-based approach for fitting a univariate (p = q = 1) hierarchical linear model 

1954|). The method takes group-specific 


(Cochran 1937 Yates and Cochran, 1938 Cochran 


estimates of the effects and then uses weighted moments of these estimates to approximate 


the population parameters. Swamy (1970) extended Cochran’s method to multivariate settings, 


and Cox and Solomon (2002) further extended it to allow for hierarchical nonlinear models. 


The main advantage of these moment-based estimation methods is that they are not iterative. 
For these methods, and for the extension we introduce, there is a computational cost of roughly 
0{N(p+q) 2 } to fit the initial group-specific estimates, followed by a cost of 0{M (p+q) 3 + Mq 4 } 
to combine them. Furthermore, most of the operations are embarrassingly parallel, in the sense 
that it is trivial to split them across multiple processors. 

Moment-based estimation methods for hierarchical models are simple and computationally ef¬ 
ficient. Unfortunately, existing moment-based approaches require that X t = Z, t for i = 1,..., M. 
Moreover, they require each predictor matrix Xi to have full rank. These restrictions seem in¬ 
nocuous, but they become prohibitive in many large scale estimation problems, including the 
recommender system application discussed in Section [8] This motivates us to introduce an al¬ 
ternative extension of Cochran’s method, similar in spirit to Swamy’s procedure, but allowing 
for arbitrary fixed effects and removing most restrictions on the ranks of the predictor matrices. 


3.2. Intuition from the hierarchical linear model 

To gain an intuition into our procedure, we start by considering the hierarchical linear model. 
For i = 1,..., M define feature matrix F r = [X, Zj\ of size n* x (p + q) and effect vector 
ri i = [/3 T uf] T of dimension (p + q). The first p components of are shared across all M groups, 
and the last q components are random and specific to group i. The group-specific response vector 
can be expressed as 

Vi = Fihi + £ i> 

where £, has mean zero and is independent of Ui. 
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Define the least squares estimate 


fj t = (FjF^Fjy,, 

where I denotes Moore-Penrose pseudo-inverse. Previous approaches required F t to have full 
column rank, but we make no such restriction. Notably, without this restriction it will not 
generally be the case that E(fj i | it;) = r] i . Rank degeneracy leads to aliasing in the coefficients, 
which precludes unbiased estimation. 

Despite potential aliasing, the estimate f] i still contains information about the effects in the 
subspace spanned by the rows of Fi. Specifically, let 

/ = UiDiVj 


be a compact singular value decomposition, where Di >- 0 is diagonal with dimension r 7 ; x r, and 
UjUi = VjVi = I Ti . Let V n and V *2 (dimensions p x n and q x r,) contain the first p and 
last q rows of Vi, respectively, so that 


Xi = UiDiVl i, Z, = UiDiVj, 

with VJiVii + V^V i2 = I ri . Then, 

E(Vjfj t \u t ) = Vj 1 f3 + Vj 2 u i , (2a) 

cov(Vjfj i | m) = <j)DY 2 , (2b) 

where <j> = o 2 = var (£ij). Hence, the unconditional expectation and covariance of the effect 
components orthogonal to the nullspace of Fi are 

Eiyjfji) = Vl(3, 
cov(Vjf) l ) = V^V l2 + ^D~ 2 . 

In Section [373] we show how to use these moment relations to estimate the model parameters. 
For the dispersion parameter, we will use the unbiased estimator 


1 


= <j = 


M 

N-p ^ 

r i=1 


- F 


^ II2 


where ||-|| denotes Euclidean norm and p = X]f=i r %- -^ s l° n g as rii > r.i for at least one group i, 
this estimator is well-defined. 


3.3. The general procedure 

We define the general estimation procedure without reference to the response, the predictor 
matrices, or the specific data-generating mechanism. As a starting point, we will suppose that 
we have the following: 

(a) random effects u\,..., Um that are independent with mean zero and covariance matrix E; 

(b) group specific effect estimates r ) 1 ,..., r) M that satisfy the conditional moment relations ([2]); 

(c) matrices Di and V,; = [V, 1 -, VJ 2 ] T (i = 1, • • ■, M), where V» has r, orthonormal columns, 
and Di is a symmetric positive-definite matrix (not necessarily diagonal); 

(d) dispersion estimate <f> that has expectation <j). 
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The procedure depends on a choice of symmetric positive-definite weight matrices, denoted 
where W t has dimension r, x r*. We will discuss choices for the weights in 
Section [4] but for now, take them to be arbitrary. 

We will use the weights to combine the group-specific estimates into an estimate for the fixed 
effect j3. To do so, define 

M 

n = Y J V ll w t vj 1 . ( 3 ) 

2=1 

If ft is invertible, then we can define a moment-based estimator for /3: 

M 

$ w = n- l Y J V il w l vjf), l . (4) 

i=l 

By construction, 0 W is an unbiased estimator for /3. 

To introduce an estimator for the random effect covariance matrix S, first define the matrix¬ 
valued function 


M 

Mb) = E V * w i(Y?Vi - V hb){Vjf,r 
2=1 


vibyWiV 


T 

i2 ’ 


Set 

M 

n 2 = E V ‘2 W ‘VI 2 ® V a WiVj 2 , (5) 

1=1 

where ® denotes Kronecker product (with the notational convention that ® has lower precedence 
than matrix multiplication). When f2 2 is invertible on the subspace corresponding to symmetric 
matrices, define symmetric q x q matrix-valued function S(b) and symmetric q x q matrix B via 
the relation 


vec{S'(6)} = fi 2 1 vec{A(6)} 


M 


vec 


(B) = Cl; 1 vec { E V a WiD7 2 WjVj 2 }, 


i =1 


where vec(-) denotes column vector concatenation. For all matrices B , C, and X of consistent 
dimensions, vec (BXC) = ( C T ® B)vec(X). It follows that 

E{S{J3)} = S + <t>B. 

In light of this relation, define moment-based covariance matrix estimator "Sw as 


Sw — S{0 W ) — (j>B. 


( 6 ) 


Due to the dependence between j3 w and fj.^ the matrix Svv is n °t an unbiased estimate of S, 
but we will later show that its bias is often negligible. 

In practice, the estimate may not be positive semidefinite. To handle this situation, we 
can replace by the projection of Svv onto the cone of positive semidefinite matrices. 


Carter and Yang (1986) employ a similar modification. For any continuous function, g , if the 
convergence Svv A S holds, then g(T,w) A g(S). Thus, since projection onto the cone of 
positive semidefinite matrices is a continuous function, by the continuous mapping theorem, if 
Svv is a consistent estimator of S, then is as well. 
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The estimator (3 W as defined here is similar to the estimator used by Swamy (1970) and the 
other authors mentioned in Section 3.1 but, unlike the existing approaches, the form in (4]) allows 
for rank-degenerate predictor matrices. The estimator Sw is unique; earlier approaches used a 
simple unweighted covariance estimate, which requires full-rank predictor matrices to guarantee 
consistency. 


3.4. Application to hierarchical generalized linear models 

For a hierarchical generalized linear model, we will require subpopulation-specific effect estima¬ 
tors 7) i for r) i = [ f3 T uJY (i = 1 ,..., M) and a dispersion estimator cj). With these, we will apply 
the moment-based estimation procedure described in the previous section to get estimators for 
(3 and E. 

For most nonlinear models, the moment relations Q will not hold exactly. These relations 
will be approximations, with the quality of the approximation depending on the relative sizes of n,i 
and p + q. When using the moment-based procedure to estimate the parameters of a hierarchical 
generalized linear model, the estimators (3 W and will be biased, and we will not be able to 
get theoretical performance guarantees. However, as we later demonstrate in Sections [7] and [8] 
in many large-sample regimes, the moment relations Q are reasonable approximations, and the 
moment-based estimators perform well. 

As in the linear case, some of the group-specific feature matrices F, = [X, Z,\ (i = 1,..., M) 
may be rank-degenerate. We can handle these degeneracies by imposing linear identifiability con¬ 
straints on the group-specific estimates. Specifically, letting Vi be a matrix with ri orthonormal 
columns spanning the row space of Fi , we will require that r), lie in the span of Vi. With this 
constraint, under standard regularity conditions, if the maximum likelihood estimator exists then 

_ -I /q 

it will be unique, with conditional expectation E(Vjfji \ Ui) = Vjr)i +o(n i ' ) and conditional 
covariance cov(V'Jf) i | m) = 4>VJ(FiVi + o{r\ ) for a matrix A* depending on (3 and u t . 
We will use a plug-in estimate for A,, which will lead to a consistent estimate for cov(Vjffi \ Ui) 
as 7 ii increases. 


Unfortunately, even with the rank-degeneracy issue solved, the group-specific maximum like¬ 
lihood effect estimator may not exist for all i. In logistic regression models, this happens when 
the outcomes are perfectly separated by a linear combination of the predictors. One popular 



In light of these properties, we take to be Firth’s modified estimator instead of the maximum 
likelihood estimator. 

For </>, we will use a weighted combination of group-specific dispersion estimates rj> i,..., 4>m- 
With the usual Pearson residual-based dispersion estimate, <j>i will be approximately distributed 
as a chi-squared random variable with (n^ — r^) degrees of freedom, scaled by (f>/{rii — rj). 

The full procedure for estimating the parameters of a hierarchical generalized linear model is 
as follows: 

(a) For each group i = 1 ,..., M: 

(i) Construct group-specific feature matrix Fi = [X, Zj\\ use a singular value decompo¬ 
sition to decompose this matrix as Fi = FoiVf, where Foi has full column rank 
and Vi = [Vft VJ 2 ] T is a matrix of dimension ( p+q ) x r, with orthonormal columns. 

(ii) Use Firth’s modified score function with data (y^F oi) to get group-specific effect 
estimate fj oi . 
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(iii) Set D 2 to be a plug-in estimate of the unsealed conditional precision matrix of fj oi ; 
that is, set D~ 2 to be a plug-in estimate of (j )~ 1 co v(fj 0i \ Ui). 

(iv) Set f)i = Vifj 0i . 

(v) If <f> is unknown, compute group-specific dispersion estimate <f>i. 


(b) If <f) is unknown, compute pooled dispersion estimate 


E"i (n*-rO ’ 


otherwise, set <fi = <fi- 

(c) Choose positive-definite weight matrices W ..., Wm- With these weights, use Q and (j 6 j 
to compute estimates /3 = 0 W and X = Sw 

(d) Check if S is positive semidefinite. If not replace S with a projection onto the positive 
semidefinite cone. 

(e) Optionally, use the esitmates $ and X to choose a new set of weight matrices and redo 
steps Q and ©• 

(f) If required, use normal approximations for the distributions of it,; and 17 , | it; to compute 
empirical Bayes posterior mean and covariance estimates for up. 

E(u l \y) = C l V l2 (Vjfj l -Vj 1 $), 
cav(u, | y)=4>Cu 

where Ci = X ^ (^Jg + X ^ VJ 2 D 2 V ^ ) _1 X ^ . These quantities exist even if S does 
not have full rank. 


If we assume that at most a constant number of iterations are required in step (ii), then the 
computational complexity for fitting the itli group in step (|a| is of order 0(riii' 2 ), so that the 
total cost of step (Ja| is of order 0{N(p + q) 2 }. Step (|b| has cost O(M). For all choices of weight 
matrices discussed in this report, computing W, requires at most 0{r i g(r i + g) 2 -l-rf} operations, 
so that computing all M weight matrices has cost 0{M{p + g) 3 }. Once the weights have been 
computed, it takes 0{Mp(p + q) 2 } operations to compute $ w , followed by 0(Mq(p + q) 2 ) 
to compute A(/3 W ) and 0(Mpq 2 + Mq 4 ) to compute These are the dominant consts. 
Conservatively, step |c]) requires 0{M(p + q) 3 +Mq 4 } operations. Step (Jd]) has cost 0(q 3 ). The 
costs for the remaning steps are similar to those already discussed. 

In total, at most 0{N{p + q) 2 + M(p + q) 3 + Mq 3 } operations are required. This bound uses 
the approximation r; = 0(p + q), which is often conservative. In fact, in situations where the 
column space of Zi is contained in the column space of Xi for all i, we will have rg < p. In this 
scenario, at most 0{Np 2 + Mp 3 + Mq 4 ) operations are required. 

Notably, once the group-specific effect estimate fthe conditional precision estimate D 2 , 
and the dispersion estimate </>j have been computed, the procedure has no need for y i and Fj. 
This is both a strength and a weakness. It is a strength because it reduces the computation 
and the memory demands of the procedure, and it allows most of the operations to be trivially 
parallelized. The weakness in this data reduction is that it likely sacrifices statistical efficiency. 
On balance, as later we demonstrate in Sections [7] and [ 8 ] in many large-scale data regimes it is 
worthwhile to make this computational-statistical trade-off. 
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4. Weight choices 

4.1. Weighted, unweighted, and semi-weighted cases 

The estimators introduced in Section ph3| depend on a choices of weights Wi (i = 1,..., M). The 
choice that minimizes E\\0 W — (3\\ 2 is 

W i = (Vj 2 HV l2 + D- 2 )~\ ( 7 ) 

where S = In general, we do not know S and <j>, so we cannot use these weights. 

In the univariate case, Cochran discusses three practical alternatives. The first option, which 
he calls the “unweighted” method, corresponds to setting Wi = I r .. The second option, which 
Cochran calls “weighted,” corresponds to setting Wi = D~ . The last option depends on an 
initial choice So and corresponds to setting Wi = (Vj 2 ^oVi 2 + .D“ ) _1 ; Cochran calls this the 
“semi-weighted” method. Following Cochran and Swamy, we use a two-step estimation scheme, 
taking an initial choice of weights to get a preliminary estimate So of the scaled random effect 
covariance matrix, and then using this estimate with the semi-weighted method to choose a new 
set of weights, repeating the estimation process. For the initial choice of weights, we use the 
semi-weighed method with Sq chosen as specified in the following section. 


4.2. Optimal weights 

In this section, we will study the optimal weight choice. We do not give a complete analysis, 
but we will derive a heuristic choice based on minimax optimality considerations. We will show 
that, after standardizing the predictors, it is reasonable (and sometimes optimal) to choose the 
semi-weighted Wi with S 0 = I q . 

For i = 1 set = Vjfji. We will use a weighted combination of the estimators 

0i, ..., 6m to estimate /3. Let a = (cti, ..., olm) be a vector of weight matrices, where com¬ 
ponent matrix a.i has size r, x p. Define estimator 0 a = aJOj, which has expectation 

E(0 a ) = (£jli VnOLi^f3 and covariance cov(/3 a ) = J2iL\ Q! H cov (^i)} Q: i- For 0 a to be unbi¬ 
ased for all 0, we must have YliL i Va a i = Ip- 

Among all choices of a that make /3 a unbiased, the one that minimizes the mean squared 
error E\\j3 a — f3\\ 2 is the one minimizing trjcov)/^)}. Letting cx.,k denote the fcth column of a.i, 
the squared-error-optimal choice of a must satisfy the Lagrangian gradient equations 


{cov(0i)}a ifc = VTiojfc, 


with pxp Lagrange multiplier matrix = [wi • • ■ <jJ p ] . Thus, the optimal unbiased weight vector 

satisfies 

a* = {cov(e i )}“ 1 V'^ 1 0, 

with f 2 = i minimizing estimator 0 has cov(/3 ) = fl. 

The weight a* depends on the unknown quantity S = We would like to find a weight 

which is independent of these unknowns. To measure the sub-optimality of any particular choice 
of a , assume <fi = 1 without loss of generality, and define the risk function 

R( E,a) = tr{f2 _1 cov(/3 a )}. 

Ideally, we should choose the weights that minimize the maximum risk. In practice, it is difficult 
to solve the underlying optimization problem to find this set of values for a , so we instead will 
choose the weights a based on a heuristic. 
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Define extremal risks Ro(ot) and i?oo(a) as 

M M 

{ 2 n,} , 

i=l i= 1 

M M 

{ E ^i(n 2 V i 2 ) t n i }{E • 

i=l i=1 

Instead of finding a. to minimize sup s R(UjCx), we will attempt to find weights that minimize 
the average R(ol) = (Rq(ol) + R 00 (ct))/2. To this end, set 

M M 

B = E C = E VniVlV^Vl. 

*=1 i=l 

For ci: to minimize i?, while simultaneously satisfying the unbiasedness constraint, its itli com¬ 
ponent must satisfy the Lagrangian gradient equation 

D- 2 a,B + VJ 2 V i2 oaC = V^A 

for some p x p matrix of Lagrange multipliers, A, independent of i. In vector form, 

( B T ® D ~ 2 + C T ® Vj 2 V i2 ) vec(a,;) = (I p 0 V^) vec(A). 

The unbiasedness constraint VjiQj = J p must also hold. 

Finding <5 and A requires solving a linear system of p'YldLi r i + P 2 equations in as many 
unknowns. For general situations, this is computationally expensive. However, in the case of a 
hierarchical generalized linear models satisfying XEi XjXj = MI p and J, = for all i, we 
get the simplification B = C = MI p ; in this case, the optimal weight is 

cx i = (Vj 2 V i2 + D- 2 )- 1 Vj 1 ti, 

with Cl chosen such that ~ I- This corresponds to the semi-weighted case using 

So = I q . Motivated by this correspondence, in practical applications we will standardize the pre¬ 
dictors and then use the semi-weights with S 0 = I q . In addition to the optimality considerations, 
the standardization ensures that the procedure is equivariant. 


Roo{ot) = lim R(tlq,a ) = tr 

t—>■ OO 


Ro(ot) = lim R(tl q , a) = tr 


5. Finite sample properties of moment-based estimates 

5.1. Theoretical framework 

To analyze the performance of the proposed moment-based estimation procedure, we will need to 
be precise about what assumptions are required. To facilitate asymptotic analysis, we will state 
these assumptions in terms of sequences indexed by N. We make this dependence on N explicit 
in the assumption statements, but, to simplify the notation, will suppress this dependence in 
most of the text. 

Assumption 1. There exists a non-random p-dimensional fixed effect vector (3 and, for each 
value of N there is a sequence of M(N) independent and identically distributed q-dimensional 
random effect vectors: The ith random effect vector can be expressed as 
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uj\r t i = S^utv,* where S 1 ^ 2 is the symmetric square root of positive semidefinite matrix X, and 
the sphered random effect vector u;si,i satisfies the moment conditions 


E(u N ,i) = 0, 

(8a) 

COv(UTV,z) = Iq , 

(8b) 

E\\u n Y < ft 

(8c) 


for some finite constant ft. 

Assumption 2. For each N and all i = 1 ,..., M(N) there exists a matrix with orthonormal 
columns VN,i = [^/vti Vn aY> an d a symmetric positive-definite matrix DN,i (not necessarily 
diagonal) such that Vn,h an d VN,i2 have dimensions p x rjv,* and q x rjv,i> respectively, and 
Dn,i has dimension r^,i x Further, the following conditions hold: 

(a) The matrix VN,nV)srii invertible. 

(b) The matrix YN.aVjj a) ® Y NpYY 12) invertible on the subspace S q of vectors 

s satisfying s = vec (S) for some symmetric q x q matrix S. 

Assumption 3. Letting rj N i = [/ 3 T uJj i ] T be the true (p+q)-dimensional effect vector for the 
ith group, there exist group-specific effect estimates rj N 1 > • ■ ■ > Vn m(n) such that the estimation 
error = rj N i — r) N i satisfies the moment relations 

EYnYna) = 0, (9a) 

CO vYN,i h N,i) = </>-D)v 2 u ( 9 b) 

E\\cj)- 1/2 D Nti V T N YN,i\\ 4 < A (9c) 

for some dispersion parameter <t> and finite constant X. Furthermore, the estimation errors 
ft, tv. 1, ■ ■ •, ft_/v,M(iv) and the random effects mtv.Ij ■ ■ •, un,m(n) are mutually independent. 

Assumption 4 . For each N there exists a random dispersion parameter estimate <pN inde¬ 
pendent of the vectors ftjv,i, • ■ •, ftjv,M(iv) an d Un,i ,..., Utv,m(at) such that 

E($ N /<t> - l ) 2 < v/{N - p N ) ( 10 ) 

where pjy = Y)iLi r N,i < N and v < 00. 

These assumptions are motivated by the linear case introduced in Section |3.2| Assump¬ 
tion [2ja) ensures that (3 is identifiable; it holds if and only if the combined predictor matrix 
X = [X^ • • • X ' M \ V has full column rank; Assumption [2jb) ensures that X is identifiable; it 
holds if and only if YlnLiYJZi) ® ( ZjZi ) is invertible on S q . Assumption [ 3 ] holds for the 
hierarchical linear model whenever E |ejj| 4 < 00; for nonlinear models, including hierarchical 
generalized linear models, Assumption [3] will not hold exactly, but it will be a reasonable approx¬ 
imation whenever the group-specific sample sizes are large. For Assumption [4] in models where 
the dispersion parameter is known it suffices to take <^>tv = 4> and v = 0. 

Assumption 5 . For each N there exists a sequence of symmetric positive-definite weight 
matrices W tv,i, ..., W n,m(N) where the ith weight matrix has dimension Xn,i x rj\r,i and satisfies 
the relation 

W Nti (V T Nti 2 ?lV N>i2 + cfD- N 2 .)W NA A n N W N>i 

for some nonrandom sequence kn independent of i. 


( 11 ) 
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Table 1. Weight choices and associated bounding constants 


Method 

Wi 

K, 

Unweighted 

Ir, 

||£|| + fma,Xi\\Df 2 \\ 

Weighted 

D 2 

£ maxi \\D 2 | + f 

Semi-Weighted 

(V7 2 £o Va + DT*)- 1 

l|s ( 7 1 £||+</> 


Table |5.1| shows the bounding constants from Assumption [5] associated with each weight 
method discussed in Section [1J It is straightforward to derive these bounds for the unweighted 
and weighted cases. For the semi-weighted case, we derive the bound in Lemma [5.1| Generally, 
\\Di\\ will scale proportionally to the square root of the group-specific sample size, n i . We can 
see that the bound for the unweighted case degrades if some ni is small, while the bound for the 
weighted case degrades if some ni is large. The bound for the semi-weighted case is insensitive 
to the group-specific sample sizes. 

Lemma 5.1. //S 0 and D\, ..., Dm are positive-definite and £ is positive-semidefinite, then 
for the weight defined by Wi = (VJ 2 S 0 V^+Df 2 ) -1 , Assumption^ /icdds with n = ||£ 0 1 £|j+^>. 

PROOF. We will drop the subscript i for the proof of the lemma. First, note the relation 
D 1 WD ~ 1 = {DV 2 X 0 V 2 D + Ir )- 1 A I r , so that 

<PW 1 / 2 D~ 2 W 1/2 A <j)I r . (12) 


Next, use the matrix inversion lemma to express 

W = D 2 - D 2 VlfiZ 0 1 + V 2 D 2 V T 2 )~ l V 2 D 2 . 

Use the identities I — {A + B)~ 1 B = (A + B)~ X A and B(A + B) _1 = I — A(A + JB) -1 to get 

VJVVl = S " 1/2 {/ 9 - (J, + :L/2 - 

Employing the bound I — (/ + A) -1 A /, which holds for any positive-semidefinite matrix A, it 
follows that V 2 WV 2 A £q X . Thus, 


W 


1/2 


V^SV 2 W 


1/2 


d llsSo 1 !!/,.. 


The result of the lemma follows from (121 and rtl3|. 


(13) 


5.2. Existence 

For the estimates 0 W and Sg to be well-defined, we must have that the corresponding quantities 
ft and fl 2 are invertible. Propositions |5.2| and |5.3| show that this is always the case whenever 
the group-specific weights are positive definite and Assumption [2] is in force. 

Proposition 5.2. For i = 1 ,...,M let Wi be a nonrandom symmetric positive-definite 
matrix. If Assumption^ a) holds, then the matrix ft defined in (J3| is invertible, so that j3 w is 
well-defined. 

PROOF. The matrix ft is symmetric, so it suffices to show that it is positive-definite. We 
will proceed by contradiction. Suppose that the statement of the proposition is false, so that for 
some nonzero vector t, the identity t T flt = 0 holds. In this case, since Wi is positive-definite, it 
must follow that Vjyt = 0 for all i = 1,..., M. Thus, Y^fLi ii^i Vfyt = 0. This contradicts 
Assumption [2]( a). It must follow, then, that t T ftt > 0 for all nonzero t, so that ft has full rank. 
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We state the result for El 2 , which follows by a similar argument, as Proposition |5.3| The full 
proof of this result is given in Appendix^ of the on-line supplement. 

PROPOSITION 5.3. For i = 1 let Wi be a nonrandom symmetric positive-definite 

matrix. If Assumpt.ion^b) holds, then the matrix El 2 defined in |5| is invertible on S q , so that 
is well-defined. 


5.3. Concentration 

The next results, Corollary 
Siy are close to their estimands. 


5.5 


and Proposition 


5.6 


Proposition 5.4. If Assumptions [ 7 ] [^| and [ 5 ] 
relations 


show that with high probability, 0 W and 
are in force, then 0 W satisfies the moment 


E{0w) — 0i 
cov(0 w ) A 


(14a) 

(14b) 


is 


PROOF. We have 0 W = El 1 £^1 V’.iW’iV’T’ifc. Proposition^ shows if Assumption [| 
in force, then El is invertible and consequently 0 W is well-defined. Assumptions [l] and [ 3 ] imply 
that EfVjfji) = V7i/3, so that E{0 W ) = 0. Additionally, these assumptions together with 
Assumption [5] imply that 


M 


cov 


(0 W ) =El- 1 {J2v tl W t (Vj 2 ^V l2 + cl ) Di 2 )W l Vj 1 }El- 1 A rXT 1 . 


i=1 


COROLLARY 5.5. If Assumptions [7] [1[ [3|a7i(i[5] are in force, then for any e > 0, 

Pr{|| 0 W - 0\\ 2 > e -1 Ktr(n -1 )} < e. 

PROOF. From Proposition |5.2| it follows that 

E\\0 W - 0\\ 2 = E[tr{(0 w - 0){0 W - 00}] 

= tr{cov(/3 w )} 

< Ktr(Jl _1 ). 


Now apply Markov’s inequality. 

Proposition 5.6. If Assumptions [7] [ 5 ] are in force, then for any e € (0,1], 

Pr{||S w - S||| > e~ 2 k 2 C 2 tr(fl 2 1 )} < e, 

where C = {9p 3 / 2 + 3(A + 2) 1 / 2 + p 1 / 2 + n 1 / 2 ( N/p — l) -1 / 2 }/2. 

Proof. Define S analogously to S be replacing i) i with rj i . The triangle inequality implies 
that 

l|Sw - S||f < ||S(4iv) - S(/3)||f + II S(0) - S(0) - 0B\\ f + ||S(/3) - S|| F + \0 - 0\\\B\\ F 

We analyze the right hand side summands in Appendix [B] of the on-line supplement; Lemma |5.7| 
stated after the proof of Prop. |5.6| summarizes these results. 
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Fix any a > 0. Set u = Lemma [5.7p | shows that 

Pr(||S(/3 w ) - S(J 3)|| f > dap 3 ' 2 ™ 1 ' 2 ) < a~\ 
Lemma 5.7 b| and Markov’s inequality imply that 

Pr{||S(/3) - S((3) - (f>B\\ F > 3o(A + 2 ) 1 / 2 kw 1/2 } < a~ 2 . 


Similarly, Lemma |5.7[ [c| and Markov’s inequality imply that 

Pr{||S'(/3) - S|j F > ap 1/2 noj 1/2 } < a~ 2 . 

For the final term, Assumption [4] implies that 

Pr{| (j)/(f) — 1| > an 1 ^ 2 (N — p ) -1 / 2 } < 1/a 2 , 
and Lemma. |5.7[ [d|l implies that 

0||B||f < K / 9 1/2 ||r2 2 ' 1 || < up 1 / 2 w 1 / 2 . 

Thus, with probability at least 1 — (l/a + 3/a 2 ), 

|!£ w - S|| F < anoj 1/2 {9p 3/2 + 3(A + 2) 1 / 2 + p 1 / 2 + v 1 / 2 (N/p - 1)“ 1/2 }. 

Set £ = (1/a + 3/a 2 ). If e < 1, then a -1 = y/1 + 12£ — 1 > 2e. This gives the desired result. 
Lemma 5.7. If Assumptions [7J [1J [5[ and [5] are in force, then the following identities hold: 

(a) Pr/jlS/ZV) - S((3) || F < 9p 3 / 2 K{tr(flf 1 )} 1 / 2 /e} > 1 - e, 

(b) E\\S((3) - S(J3) - <f>B HI < 9k 2 (A + 2) tr^ 1 ), 

(c) -E7II <S'(/3) - £||| < pn 2 tifClf 1 ), 

(d) ||B|| F <r 1 «P 1/2 ||fi2 _1 || 1/2 > 

where p = r i- 


5.4. Near relative efficiency 

We now show that with the semi-weighted method, if the initial choice for So is close to the true 
value S = </ _1 S, then the weighted estimate is close to optimal unbiased weighted estimate. In 
this sense, it is close to being “relatively efficient”. 

To be precise about this equivalence in efficiency, let 6 o denote the vector with \ + q(q + \)/2 
components, gotten by concatenating cj> and the unique elements of S. For any parameter vector 
6 with the same dimension, let S# and fig denote the corresponding values of the random effect 
covariance matrix and the dispersion parameter. Set 

M 

&g = ng 1 J2 V * W <»Vlr) i , (15) 

i =1 


where 


M 

We, = (v/ 2 s e v i2 + D~ 2 )-\ n e = J2 W 0i , 

i—1 


( 16 ) 
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and Eg = <f g 1 Se- If Assumption [ 2 ] is in force, then Proposition 5.2 implies that 1 exists and 
0 g exists for all 9. Define 0, Wi, and ft as the quantities gotten by setting 9 = 9q. 

The next result states that f or all p arameter vecto rs 9 in a neighbourhood of 9 0 , the estimate 


0 g is uniformly close to 0. Carter and Yang (1986) state a similar asymptotic result in the 


context of Swamy’s estimation procedure; their heuristic proof of this result uses different but 
related techniques. 


PROPOSITION 5.8. Let B be any neighbourhood, of the true parameter vector 6 q. For any 
e > 0, if Assumptions [I}{3] are in force, then 

Pr{sup||O 1 / 2 (/3g( - 0)\\ 2 > Cs- 1 t 2 } < e, 
ggb 

where t = sup e6 g max{|jS 1 Se — I q \\, ||S e 1 E — -Tj||} and 


C = 9{48p 3 + A8pq 3 (l + Apr) 2 + 768pr 2 (l + Apr)}, 


with p = YhLi r i- 

Proof. For any vector 9, write 0 g — (3 = 1 YldLr Vii W Q0Vjf) i — Vj 1 0). Now, 

M 

0 g -0 = {0 e -0)-{0-0) = J2( n e 1 V* 1 W ei - H- 1 VaWMVjfii - V00). 

i =1 

Set 7 1 = W 1 i /2 (Vjfj i - V' n 0). It follows that 

M M 

0 e ~0= ( ft* 1 - ft' 1 ) Y V ll W 1,2 ll + ft g 1 Y VniWgi - W i )W~ 1/2 li . 

i= 1 i= 1 

Letting Egi = Vf 2 {^e — ^)Vi 2 , the identity (A + E )~ 1 — A~ x = —(A + E)~ 1 EA~ 1 implies 
that 

W 6l - W, = -W ei E ei Wi = -WiEoiWi + Wo.EnAV.EoAV,. 

With this identity, it follows that the scaled difference between the two estimates can be expressed 
as 

fl 1/2 (0 g - 0) = 809) + S 2 (9) + 609), 

where 

M 

609) = {ft 1 ' 2 ft, 1 ft 1 ' 2 - 10ft- 1 ' 2 J2 VaWl'%, (17a) 

i= 1 
M 

609) = -ft^fig 1 Y VnW l E ei W 1 J 2 li , (17b) 

2=1 
M 

6 3 (9) = ft 1 ' 2 ft, 1 Y V^W„,E 0 ,W,E () ,W) S,. 

2=1 

Further, if Assumptions [l] and [ 3 ] are in force, then A( 7 ,-) = 0 and cov( 7 j = I ri . 


(17c) 
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Lemma |5.9| stated at the end of Section |5.4| and proved in Appendix [C] of the on-line sup¬ 
plement, bounds the terms in ( |17[ ). This lemma implies that with probability at least 1 — £, the 
following three inequalities simultaneously hold: 

sup||( 0)|| 2 < 48£"Vt 2 , 

0gb 

sup|| J 2 ( 6>)|| 2 < 48e"W t 2 (1 + 4 pr) 2 , 

0GB 

sup||(0 )|| 2 < 768£ _1 pr 4 (l + Apr). 

0GB 


The result of the proposition follows since ||Jl 1 / 2 (/3 e —/3 )|| 2 < 9 {||< 5 i( 0 )|| 2 -|-||< 52 ( 0 )|| 2 -l-||^ 3 ( 0 )|| 2 }. 


Lemma 5.9. Let functions 8 i(Q), 82 ( 6 ), and 8 3 (0), be defined as in ( 17a) —(17c ). If Assump- 
tions fTl-T?] are in force, then for any e > 0 and any parameter set B, 


Pr{sup||£i(0 )|| 2 > 16£“VV} < e, 

0GB 

Pr{sup||5 2 (0 )|| 2 > 16£"Wt 2 (1 +4pr) 2 } < e, 

0GB 

Pr{sup ||^3 (0)|| 2 > 256£^ 1 pr 4 (l + Apr)} < e, 

0GB 


where r = sup egB max{ 


|S *£ e -I v 


is/s-jpll 


and p = Y^iLi r i- 


6. Asymptotic properties of two-step estimates 

In Section [5] we established finite-sample existence, concentration bounds, and near relative 
efficiency for moment based estimates. Given the finite-sample results, it is straightforward to 
derive asymptotic analogues of these properties in settings where the sample size tends to infinity. 
We will need an additional assumption on the bounding constants: 


Assumption 6 . The sequence of bounding constants kn defined in Assumption [5] satisfy 
lim sup w kn < 00 . 


Referring to Table |5.1| we can see that Assumption [ 6 ] holds for the unweighted case whenever 
||.D,:|j is bounded away from zero, and for the weighted case whenever ||Z?j|| is bounded away 
from infinity. For the semi-weighted case, Assumption [ 6 ] holds whenever So is positive-definite. 

In addition to assumptions on the bounding constants At at, the asymptotic results require 
conditions on ft and fl 2 - To state these conditions, we define the quantities 


LOn 


inf 

tSRp 


t T nt 

rT’ 


WJV, 2 


inf 


s' i T22'S 

s T s 


The quantity con is the smallest eigenvalue of f2; similarly, wat ,2 is the smallest eigenvalue of 
restricted the space S q . The asymptotic results require that lon and wat ,2 go to infinity at or 
above a specified rate. Typically, a necessary condition for wat ,2 to go to infinity is that M —> 00 . 
For example, in the unweighted and the semi-weighted case with S 0 >- 0, one can show that 
un ,2 = O(M); thus, for uin, 2 to diverge to infinity, it is necessary to have M —>• 00 . 

Our first result establishes that the moment-based estimators for (3 and S are consistent. 
This result follows immediately from Corollary |5.5| and Proposition |5.6| 
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Proposition 6.1 (Consistency). If Assumptions mu are in force, then the asymptotic 
limits of (3 W and £ w are determined as 

(a) If LOjsf —r oo, then $ w A /3. 

(b) If ujn, 2 {N/pn — 1) — > oo and ujn ,2 —> oo, then £vv A £. 


Next we establish that the two-step estimate for (3 is relatively efficient. To state this result, 
as in Section [7] let /3 be the moment-based estimat e of (3 with variance-minimizing weights 


6.2 


from ([7|, and let /3 0 be as defined in (151. Proposition 
is asymptotically as efficient as /3. This result follows from Proposition 
Appendix [D] of the on-line supplement gives a complete proof. 


shows that the two-step estimator f3 : 

and Proposition 


5.6 


5.8 


Proposition 6.2 (Relative efficiency). For each N, suppose that W n, i, ...,W n,m(n) 
are weights with bounding constants kn satisfying Assumption [h| Set 6 = (</>, Svv)- Suppose 
that Assumptions [7J|5| are in force and that £ >- 0. If pn —> oo, (N — pn) log Pn —> oo, and 
(vnj/Pn) log pw -t oo, then Cl 1/2 ($g -j9)4o. 


The next two results show that the two-step estimator (3g is asymptotically normal. 

PROPOSITION 6.3. Suppose that Assumptions [7[[5] are in force. Let j3 denote the weight- 
based moment estimate with variance-minimizing weights Wi as in Eq. ([7]). If M —>■ oo and 
i\W — > 0, then S~2 1//2 (/3 — /3) converges in distribution to a mean-zero multi¬ 
variate normal random vector with identity covariance matrix. 


PROOF. By the Cramer-Wold device, it suffices to show that for any unit vector t , the 
quantity Y = t T S~2 1//2 (/3 — f3) converges in distribution to a standard normal random variable. 
For i = 1,..., M, define 

Xi = t T n- 1 / 2 v il w i (v^ i - vi/3) = m-WvaWiivjhi + vf 2 uf), 


so that Y = J^iLiXi- It follows that E(Y) = 0 and var(P) = 1. If we can show that 
IZiLi E(Xf) —> 0, then Lyapunov’s Theorem will ensure that Y converges in distribution to 
a standard normal random variable, the desired result of the proposition. 

By the Cauchy-Schwarz inequality, 


x? < II n-WVaWiiVjhi + VluM 2 < \\^- 1/2 v tl w] /2 \\ 2 \\w] /2 {vjK + VI 


Uj ) 


Therefore, it follows that 

E{Xf) < wn-^VawyYswwy^vfhi + Vj 2Ui )W 4 . 

One can write ||r2 _1//2 y ) iLF]^|| 4 = nWiVj^W 2 . From Assumptions [l] and [ 3 J it fol¬ 

lows that E\\Wy 2 (Vjhi + Vj 2 Ui) || 4 < C for some constant C independent of N. Thus, if 
E^ill^-V^nil 4 ** 0; then J2iLiE{Xf) —> 0, and hence Y converges in distribution to 
a standard normal random variable. 


are in force, then the vector El 1 ' (j. 3g — (3) converges in distribution to a mean-zero multivariate 
normal random vector with identity covariance. 


Corollary 6.4 (Asymptotic normality). If the assumptions of Propositions^^ and 6.3 
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7. Performance in simulations 


To evaluate the performance of the moment-based estimators in practice, and to compare these 
estimators to their likelihood-based counterparts, we perform two simulation studies: one for a 
hierarchical linear regression model, and one for a hierarchical logistic regression model. This sec¬ 
tion describes the logistic regression simulation; Appendix [F] of the on-line supplement describes 
the linear regression case. Both simulations exhibit similar behaviors. 

We set the number of groups to M = 1000 and simulate N samples, with N ranging from 
100 to 100000. We set the dimensions of the fixed and random effect vectors to p = q = 5. For 
each value of N we draw 100 replicates according to the following procedure. 

For each replicate, we draw a p-dimensional fixed effect vector (3 with components /3k, k = 
1,... ,p drawn independently from a t distribution with 4 degrees of freedom. We draw random 
effect covariance matrix X from an inverse Wishart distribution with shape I and 2 q degrees of 
freedom, scaled by 0 . 1 . 

Rather than splitting the N samples evenly across all M groups, in each replicate we draw 
population-specific sampling rates A , (i = 1,..., M) as independent exponential random variables 
with mean N/M. Then, we allocate the N sample points by drawing from a multinomial on M 
categories with probability of category i proportional to AThis sampling scheme is equivalent 
to drawing n \,..., tim as independent geometric random variables with mean N/M , conditional 
on their sum being N\ it gives rise to a highly skewed distribution of sample sizes. 

For each group i = 1,..., M, once iii has been determined we draw a random effect vector iq 
as multivariate normal random vector with mean zero and covariance X. We draw random 
population-specific fixed effect predictor vectors Xij for j = 1 ,... ,m with independent elements 
such that Pi(xijk = +1) = Pr (xijk = —1) = 1/2 for k = 1,... ,p. We use the same procedure to 
random effect predictor vectors z t] . Finally, for j = 1,... ,rij, we draw response variate yij as 
Bernoulli with success probability fi l:l = logit ~ 1 (xP(3 + zJ^Ui). 

We use a variety of methods to compute estimates of the population parameters (3 and X, 
along with plug-in empirical Bayes estimates group-specific random effects Ui, i = 1 ,... ,M: 


(a) 


(b) 

(c) 

(d) 

(e) 


mhglm, the proposed moment-based estimation procedure. To compute the moment-based 
estimates, we use two-step estimators with semi-weighted initial step and Xo set to the 
identity matrix, after standardizing the predictors. The procedure is implemented in the 
R programming language. 

glmer, maximum likelihood, using a gradient-free optimization procedure applied to an 
order-0 Laplace approximation to the profiled likelihood, implemented in C++ and R by 


2013). 


the lme4 R package (Bates et al. 
sgd, which uses stochastic gradient descent to maximize a regularized version of the h- 
likelihood (described in detail in Appendix |E| of the on-line supplement). The compute¬ 
intensive inner loop is implemented in C, and the outer loop in R. 

glmer split , a data-splitting estimation procedure, which splits the data set into 10 subsets, 
computes separate estimates for each using glmer , and then combines the estimates by 
averaging them. Implemented in R. 

glmmPQL , penalized quasi-likelihood, as implemented by the MASS package by iteratively 


calling the Ime fitting procedure (Venables and Ripley 2002 1 . 


We report serial computation time for each procedure, and we do not include cross-validation 
time for the tuning parameter selection for the sgd method. 

To evaluate the performances of the estimators, we use \\(3 — /3 || 2 for the fixed effect loss, 
tr{(XX _1 — J) 2 } for the random effet covariance loss, M _1 X/i'= 1 l|X~ 1 ' / 2 ('i+: — Ui )|| 2 for the 
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Fig. 2. Performance for the hierarchical logistic model. Circle radii indicate one standard error along 
y -axis (absent when smaller than line width). 
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random effect loss, and 2N~ X log{Pij/Pij) + (1 - Pij) log{(l - /%)/(l ~ Pij)}] for 

the prediction loss, where Pij = logit -1 {xP/3 + z^up and = logit -1 (®T/3 + zjjiii). 

Fig-11 shows the mean loss, averaged over all replicates, with circle radii indicating standard 
errors along the vertical axes (absent when less than the visible line width). For moderate 
to large sample sizes, there is a noticeable loss in statistical efficiency between the proposed 
method ( mhglm ) and the methods based on maximum likelihood (glmer and glmmPQL). Still, 
the proposed method appears to be consistent. Moreover, in terms of prediction loss, it performs 
better than glmer split and sgd. 

The lower-left panel of Fig. [2] shows the sequential computation times for all methods. For 
the largest values of N tried in the simulation, the proposed method is faster than the exact 
and approximate maximum likelihood procedures by factor ranging from 10 to 100. Without 
including cross-validation time, the sgd method is faster than all other methods tried in the 
simulation. 

In this simulation, it appears that the sgd method trades substantial statistical efficiency 
for improvements in computational efficiency. The proposed mhglm method makes a similar 
trade-off, but delivers noticeably higher statistical efficiency. 


8. Application to recommender systems 


8.1. Motivation 

To demonstrate the potential utility of the proposed moment-based estimators, we apply them 
to a large-scale recommender system application. Specifically, we use them to fit a hierarchical 
model to the MovieLens 10M dataset: the N = 10000054 ratings of M = 69878 users for 10681 
movies (GroupLens 2009). Using a moment-based estimation procedure to fit a hierarchical 


model to this dataset required approximately 10 minutes of serial computation time; the glmer 
method required approximately 9 hours to fit the same model. In Sections |8.2f|8.3| we demon¬ 
strate the ability of a hierarchical model, fitted using moment based estimation, to estimate user 
preferences and predict user ratings. 


8.2. Estimating user preferences 

One goal with a recommender system is to estimate user-specific preferences. This information 
can be used to characterize the user population and to cluster the users into meaningful groups, 
possibly for targeting promotions or advertisements. Formally, we represent a user’s preferences 
by a vector of coefficients which relate observable covariates to the user’s ratings. We will try to 
estimate these user-specific coefficients from the available movie rating data. 

Each rating consists of a user, and movie, a time, and a star value between 0 and 5. We 
binarize the ratings, then use a logistic regression model to relate the binarized ratings to review- 
specific predictors. We use the same predictors for the fixed and random effects, so that the model 
reduces to a random coefficient model. Letting /3 i = (3 + itj be a user-specific coefficient vector 


(fixed plus random effect), the model specifies logit Pr(?/,;_,■ = 1 | Xij,up = xP/3 + 


Pi 

is a set of 


x-jUi = x; J 


where yij indicates whether or not rating ij is favourable (at least 4 stars) and x l0 
rating predictors. 

Our first set of predictors encodes the genre of the movie being rated. The remaining rating- 
specific predictors are motivated by intuition derived from the BellKor movie recommender sys¬ 
tem (Koren 2009). One predictor, Popularity^- captures the current popularity of the movie 
being rating. The other predictor, Previous^-, indicates whether or not the user’s previous rating 
was positive; Table [872] describes these predictors in detail. 
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Table 2. Predictors associated with review ij 


Predictor 

Description 

Genreij 

A 4-component vector with movie-specific genre scores for Action, Chil¬ 
dren, Comedy, and Drama of the rated movie. Movies belonging to mul¬ 
tiple genres have fractional scores for individual categories. We use effect 
coding, so that the coefficients for the 4 genre components sum to zero. 

Popularity^. 

A robust estimate of the logit of the current popularity of the rated movie, 
computed from recent ratings of the movie: logit}}/;.? + 0.5)/(n;j + 1.0)}, 
where Z;j is the number of users who recently liked the movie and riij is 
the number of recent reviews of the movie. Here, “recent” reviews of the 
movie are the 30 or fewer most recent reviews at the time of rating ij. 

Previous;., 

An indicator of whether or not user i gave a favourable star value (> 4) 
in his or her previous rating. This predictor is designed to capture the 
user’s current overall mood. 
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Fig. 3. Empirical Bayes coefficient estimates for the 26884 users with at least 100 reviews. 
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Fig. 4. Misclassification rate for each user, i, aggregated by group size, rn. 


We assume a hierarchical model for the coefficient vectors with E(f3 i ) = [3 and cov(/3 i ) = E 
for * = 1,..., M. We use moment-based estimators for f3 and S computed from all N ratings, 
and then compute approximate empirical Bayes estimates for (3 i {i = 1 ,... ,M) assuming that 
the coefficients come from a multivariate normal population. Fig. [3] shows the one- and two- 
dimensional marginal distributions of the empirical Bayes coefficient estimates for those users 
with at least 100 ratings. In the two-dimensional marginals, contour lines show approximately 
38%, 68%, 87%, 95%, and 99% of coefficient pairs; these lines should be elliptical and evenly 
spaced for bivariate normally-distributed pairs. For the most part, the bivariate distributions 
look approximately normal, excepting the coefficient of Previous^-. 

By looking at the associations between the estimated coefficients, we can conclude that 
(a) affinity for particular genres appears unrelated to the intercept, which encodes a user’s overall 
tendency to give positive ratings; (b) users who like action movies tend to dislike children’s and 
drama movies, users who like children’s movies tend to dislike other genres, and users who like 
drama movies tend to dislike action and children’s movies; (c) users who like action movies tend 
to prefer unpopular movies, and users who like children’s movies tend to prefer popular movies; 
(d) users who tend to give ratings similar to their previous ratings do not tend to have prefer¬ 
ences for particular genres. Not only does the hierarchical coefficient model allow for a diversity 
of user preferences (encoded in regression coefficients), it also reveals associations between these 
preferences. 


8.3. Predicting user ratings 

Often, the primary goal of a recommender system is to predict item ratings. For this task, one 
advantage a hierarchical method holds over competing methods is its ability to borrow estimation 
strength across similar users, often obtaining better estimates than a model which estimates 
user-specific coefficients in isolation. To demonstrate this ability, we compare the out-of-sample 
prediction performances of three models: a “global” generalized linear model, using a single 
coefficient vector for all users, estimated by Firth’s penalized maximum likelihood; a “local” 
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generalized linear model, which uses separate coefficient vectors for all users, independently 
estimated with user-specific data and penalized maximum likelihood; and a hierarchical logistic 
regression model, which uses approximate empirical Bayes posterior means of the coefficients in 
the hierarchical model. We fit the hierarchical model using three different methods: moment- 
based estimation ( mhglm ), maximum profile likelihood ( glmer ), and stochastic gradient descent 
( sgd). 

We randomly split the reviews into 50% for a training set and 50% for a test set. We fit 
all three models on the training set, then use the fitted models to predict the values in the test 
set. Fig. [1] shows the misclassification loss performances of the fitted models on the test set for 
each user i, aggregated by group size, rq. The lines shows the averages, and the radii of the 
circles indicate standard errors along with y-axis. All three fitting methods for the hierarchical 
models perform comparably. The hierarchical methods uniformly beat the local and the global 
models. By combining the flexibility of the local model with the stability of the global model, 
the hierarchical model is able to outperform both extremes. 

9. Discussion 

We have extended Cochran’s moment-based estimators to general hierarchical models. Unlike 
other extensions, our proposal allows for both fixed and random effects, and it accommodates 
rank-degenerate predictor matrices. The proposed estimation procedure has three main proper¬ 
ties which make it appealing in large-scale data regimes. First, the procedure does not rely on 
strong distributional assumptions. Second, even when distributional assumptions are in force, 
in large sample settings the method can exhibit estimation and prediction performance com¬ 
parable to likelihood-based estimators. Finally, and most importantly, the method has good 
computational performance, sometimes 10 to 100 times faster than existing maximum likelihood 
procedures. 

We have analyzed the proposed method, both theoretically and empirically. We have shown 
that, subject to mild regularity assumptions, the moment-based estimation procedure is con¬ 
sistent. Moreover, the two-step estimation procedure is asymptotically relatively efficient and 
asymptotically normal, facilitating inference for the fixed effect vector. 

The assumptions required for the theoretical results hold for most hierarchical linear mod¬ 
els. However, for hierarchical generalized linear models, these assumptions will only be good 
approximations when the group-specific sample sizes rii are large; when this is not the case, 
the theoretical consistency results will no longer apply. In Sections [7] and [8] we demonstrate 
that even without theoretical guarantees, the proposed method can perform well. It is an open 
question to derive exact theoretical conditions to guarantee that the moment-based estimators 
for hierarchical generalized linear models are consistent. 

It is natural to ask if the moment-based estimators discussed in this article can be extended 
to handle more general models. For more general hierarchical models with additional levels of 
hierarchy, this extension seems feasible, but implementing this procedure in practice and deriving 
the appropriate theoretical conditions to guarantee consistency will require some finesse. 

To extend the proposed estimators to more general mixed models with non-nested random 
effects, it is not obvious how to proceed. We rely crucially on the ability to get conditionally 
independent subpopulation-specific coefficient estimates. This is likely impossible with crossed 
random effects. In our recommender system application, we were able to obviate the need for 
item-specific random effects by introducing a data-dependent predictor to capture item popular¬ 
ity. While this is not a perfect solution, it falls within our modelling framework, and it is simple 
to implement. It is likely that similar predictors can be used in other contexts where one would 
normally use crossed random effects. 
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As data volumes continue to outpace computational capacity, it becomes increasingly advan¬ 
tageous to trade statistical for computational efficiency. This is sometimes difficult, and it is only 
achievable if computational demands are a primary concern throughout the development of the 
methodology. We have demonstrated that when using moment-based estimates for hierarchical 
models, it is sometimes possible to gain substantial improvements in speed without sacrificing 
too much estimation performance. 
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A. Proof of Proposition 5.3 

Note that the matrix Wi®W\ is symmetric and positive-semidefinite. Using the decomposition 

M 

^2 = ® v ^)(W t g> Wi)(V a <S> V l2 ) T , 

i—1 

it follows that fl- 2 is symmetric and positive semidefinite. Further, one can show that S q is an 
invariant subspace of fi 2 . Applying Theorem 8.1.9 from Golub and Van Loan (1996), it suffices 
to show that s T Q 2 s > 0 for all s £ S q . Suppose that the converse is true, so that there exists a 
nonzero vector s with s T fi 2 s = 0. Let S be the q x q matrix with vec(S) = s, so that 

M 

0 = s T f 2 2 s = £ \\wl / 2 v i 2 svj 2 wyx, 

i =1 

where || • ||p denotes Frobenius norm. It follows that Vj 2 SVi 2 — 0 for ^ i — 1 In 

particular, 

M M 

o = y, wvusvui = ^ T { E( y ^n 2 ) ® (v i2 vj 2 )} S 

<=i »=i 

This contradicts assumption 2. Therefore, we must have s T tt 2 s > 0 for all s £ S q , so that fi 2 is 
invertible on S q . 

B. Proof of Lemma 5.7 

We prove Lemma 5.7 parts (a)-(d) through a series of smaller lemmas, labeled as Lemmas B.2- 
B.5. These smaller results rely on a matrix version of the Cauchy-Schwarz inequality, which we 
state and prove as Lemma B.l. 

Lemma B.l (Matrix Cauchy-Schwarz inequality). Suppose matrices Ax, ..., A M and 
Bi, ..., Bm are such that Ai has dimension p x r* and Bi has dimension q x n. Define block 
matrices A=[A[ ■■■ A T M Y and B = [B\ ■■■ B T M ] T . Then, 

\\A T B\\ < \\A t A\\ 1/2 \\B t B\\ 1/2 . 
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PROOF. We first will prove a special case of the result assuming that A T A = I p and B T B = 
I q . Let A T B = UDV T be a compact singular value decomposition, where the left and right 
singular vector matrices satisfy U T U = V T V = I r , and D is an r x r diagonal matrix with 
positive diagonal entries. Note that ||A T i?|| = |j£)|j. Define A = AU and B = BV. Then, it 

~T ~ ~ T ~ 

follows that A A = B B = I r . Hence, 

0 ^ (A - BY [A — B) = 2(1 - D). 


Thus, ||D|| < 1. 

In the general case, let A = U A D A V\ and B = U B D b V t b be compact singular value 
decompositions. Then, 

A T B = V A D A U T A U B D B V T B , 

and so 

||A T B|| < \\D A \\\\D B \\\\U T A U B \\. 

The special case of the result shows that ||[/^[/b|| < 1. The general result follows from the 
identities ||-Da|| = ||.A T .A|| 1 / 2 and ||Db|| = ||.B T .Bj| 1 / 2 . 

Lemma B.2. If Assumptions 1, 2, 3, and, 5 are in force, then with probability at least 1 — e, 

\\S(0 W ) - 5(/3)|| f < 9p 3 / 2 «{tr(f2 2 " 1 )} 1 / 2 / £ . 

PROOF. Define g = (3 — /3 and note 

(Virji - VaPwHVifji - VaPwY = (V^ - VaPYV^ - V a p ) T + V'fgg' V,. 

+ - V n PY + (Vifji - V tl p)g T V A . 

Thus, the difference of the weighted sums can be written as 

vec{A($ w )} - vec{A(/3)} = Ai + A 2 + A 2 , 


where 

M 

A, = 53 V i2 W i Vj 1 gg' F V il W i Vj 2 , 

i=1 
M 

A 2 = 53 VaWiVjtfiVifji - VaPfWiVl. 

i=1 

Defining g = il 1//2 g, it follows that 

M 

vec(A 1 ) = {Y. V * W > V *® V * W ‘ V l}^®nr 1/2 vec(gg T ). 

i =1 

One can show that 

M - 1/2 

(n 0 ft)~ 1/2 d { 53 ViiWiVl ® VaWiVl} 

i =1 

Applying the matrix Cauchy-Schwarz inequality (Lemma B.l), we get that 

|!n7 1/2 vec(A 1 )|| < ||vec(gg T )|| = ||g|j 2 . 
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From Proposition 5.4, we have that £i||g|| 2 < pn. Markov’s inequality implies that for any a > 0, 

Pr(||g|j 2 > apn) < 1/a. 

Thus, there exists an event G a with probability at least 1 — 1/a on which ||<j|| 2 < apn', on this 
event, the bound 

vec (Ai)|| < apft||fi./ 1 || 1 / 2 < apK{tr(r2^' 1 )} 1 - /2 

holds. 

The term Ao can be written in vector form as 

v 

vec(A 2 ) = 

k=l 

where with e the kth standard basis vector, and 

M 

S' 2 k = 'EiVuWi) ® (VuWiVaSl- 1 ' 2 )™c{e k (V^ - V tl (3Y}. 

i=l 

Further, Assumption 3 implies that E{ 82 k) = 0 and 
M 

CO v(6 2k ) = Y,{ V v W ^ V i2 i:V * 2+( f ) Di 2 )W l Vl}®(V i2 W i V il n- 1 / 2 e k ein- 1 / 2 VlW i Vl). 

2=1 

WyWan-^ekein-^VlW 1 / 2 H w\ / 2 V il ST 1 Vl 1 W \ /2 < I rt , 

Assumption 5 implies that cov(^ 2 fe) ^ rt£l 2 , and so 

E\\nYs2k\\ 2 < Kti(EtY)- 


Markov’s inequality implies that 

Pr{||0^ 1 <5 2 fe|| 2 > apn ti/fi/ 1 )} < l/(ap). 

Since this holds for all k = 1,... ,p, it follows that there exists an event B a with probability at 
least 1 — 1/a on which 

p 

^j|ll/ 1 ^2fe|| 2 < ap 2 nti'(fl 2 1 ). 
i —1 

The matrix Cauchy-Schwartz inequality (Lemma B.l) implies that on G a D B a , 

HO^" 1 vec(A 2 )|| < (apK) 1 ^ 2 [ap 2 Ktr(r2/ 1 )] 1 / 2 = ap 3 / 2 K{tr(fi^" 1 )} 1 / 2 . 

A similar argument shows that there exists an event B' a with probability at least 1 — 1/a such 
that on G a n B' a , the norm Ufi/ 1 vec ( 2 ) 11 is bounded by the same quantity. 

Finally, the triangle inequality impliest that on the event G a n B a n B' a , we can bound the 
norm of the difference as 


\\S0 W ) - S(0)\\f < 3ap 3 / 2 «{tr(f2/ 1 )} 1 / 2 . 


This event happens with probability at least 1 — 3/a. Thus, the result of the lemma follows by 
setting a = 3/e. 
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Lemma B.3. Define matrix-valued, function S analogously to S, replacing f) i by r/i in the 
definition of S. If Assumptions 1, 2, 3,and 5 are in force, then 

E\\S(/3) - S(f3) - 4 >B\\l < 9 k 2 (X + 2) tr^ 1 ). 


Proof. Note that - V^/3 = and 

(Vjnt - VJJ3)(Vjr)i - VIPY = VT 2 UiuTV ia 


- Vj 2 UihJVi - VjhiUjV i: 


V;h,h V,. 


From Assumptions 1 and 3 imply that 

E(Vj 2 UihJVi) = 0 , 

EiVjhihjVi) = 4>D~ 2 , 
cov{vec (VlmhfVi)} = (V? 2 EV i2 ) <g> (IpDf 2 ), 
cov{vec (D l Vjh t hjV. l D. l )} A \{cfl ri ) ® (0J ri ). 


Also, Wi(<l>Dr 2 )Wi A nWi and W l Vj 2 ’EV l2 Wi A kW*. Using the identities E\\x + y + z\\ 2 < 
9(£’||a;|| 2 + .E||y|j 2 + -E||z|| 2 ) and £7||cc|| 2 = tr{cov(a;)} + ||.E(£c)|| 2 , the result follows. 

Lemma B.4. If Assumptions 1 and 2 are in force, then E{S((3)} = E and 

E\\S(/3) — E||p < pK 2 


Proof. Write 

M 

AQ3) = J2 VaWiVl^UiuJX^VizWtVl, 

i=1 

Since E(uiuJ) = I q , it follows that E{S(f3)} = E. Note that 

cov{vec(’Ui’U^’)} A A||ttj|| 4 / A pi. 

Using this relation and the fact that WjV^EV^Wi A nWi it follows that 

M 

cov[vec{S(/3)}] A pny 4 { ^(VaWiVJtVVaWiVJt) ® (V t 2 W l Vj 2 i:V l 2 W t Vj 2 )}llf 1 

i =1 

M 

< PK 2 ny{ J2( v ^ w iVj 2 ) ® (VaWiVl)}^ 1 

i= 1 

= pn 2 ^ 1 . 

The result of the lemma follows. 

Lemma B.5. If Assumption 5 is in force, then 

iiBii F <r i «P i/2 n^ i ir /2 , 

where p = r i- 
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Proof. Write 


M 

vec (*) = £ Ai vec(I r .), 

i— 1 


where A, = ® V^W.Df 1 ). From Assumption 5, it follows that 


M 

AfAj A 0- 2 k 2 «2 1 . 

i=l 


From Lemma B.l, it follows that ||vec(B)|| 2 < </> 2 ac 2 ||S~2 2 1 |j ^f=ill vec (-^n)l| 2 = ^ 2 «: 2 p||0 2 1 ||. 
This gives the result of the lemma since ||-B||p = ||vec(IJ)||. 


C. Proof of Lemma 5.9 

We prove Lemma 5.9 in a series of smaller lemmas, labeled Lemma C.2-C.4. These results rely 
on a set of bounds derived as part of the following result. 

Lemma C.l. Consider the weight matrices Wgi and fig defined in (16), with W* and ft 
defined by setting 9 = 9 o. Set Egi = Vj 2 (Hg — S)V i2 . The following identities hold: 

Wg, - Wi = -WgiEgiWi , (la) 

M 

fig 1 -n 1 = n e l ( K Y J Wg l Eg t w^fi-\ (lb) 

i =1 

along with the inequalities 

WWiEgiW <4||S” 1 S 0 -I 9 ||, (2a) 

lin^n-ipll <4p||s” 1 s e -/ 9 ||. (2b) 

PROOF. Equation (la) follows from the identity ( A + E)~ l — AT 1 = —( A + E)^ 1 EA ^ 1 , 
which holds whenever A and (A+E) are invertible. Applying the identity again to fig gives (lb). 

To show (2a), we use that if A and B are symmetric, A is positive-definite, and B is positive 
semidefinite, then the following holds: 

|| (V T AV + D~ 2 )~ 1 V t BV || = ||(V T AV + D^ 2 )- 1 y T A 1/2 (A' 1/2 BA" 1/2 )A 1/2 y T || 

= ||(V T AV + £)- 2 )- 1 / 2 F t A 1/2 (A- 1/2 BA' 1/2 ) 1/2 || 2 

< WA^BWW^AV + D- 2 )~ 1 V t AV\\ 

= ||A _1 B|||| J - {V T AV + TT 2 )' 1 !^ 2 1| 

< 2||A _1 B||. 

If B is symmetric but not positive semidefinite, then the matrix C = AT 1 ^ 2 BA^ 1 ^ 2 might not 
be positive semidefinite, so C 1 ^ 2 may not exist. In this case, C can be written as C = C + — C_, 
where C+ and C_ are positive semidefinite. Then, since max||{||C+, ||C_||} < ||C|| it follows 
that 
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The last inequality (2b) follows from (lb) and (2a) using the bound 


M 


M 


J2 n^WotEoiWi < maxH^^U \VT e l W ei 


i= 1 
M 


< max 

i =1 

= iasx\\E ei W t \\ tr(Ip). 


Lemma C.2. Let function Si(9) be defined as in (17a). If Assumptions 1-3 are in force, 
then for any e > 0 and any parameter set B, 

Pr{sup||<$i(0)|| 2 > 16 e~ 1 p 3 T 2 } < e. 

8&B 


where r = sup egB ||S l E 0 -I q \\. 

PROOF. Set m = J7 -1 / 2 n °ting that E(m) = 0 and cov(m) = I p . It 

follows that i?||m|| 2 = p, so by Markov’s inequality, Pr(||m|| 2 > £~ 1 p) < £. Since 

||<M0)|| = \m 1 / 2 n e l n l/2 -i p )m\\ < ung'n-ZpiiiHi, 
the bound follows from Lemma C.l. 

Lemma C.3. Let 62 ( f )) be defined as in (17b). If Assumptions 1-3 are in force, then for any 
£ > 0 and any parameter set B, 

Pr{sup|!^2(0)|| 2 > 16£^ 1 pg 3 r 2 (l +4pr) 2 } < e. 
eeB 

where r = sup egB ||S l E 0 -I q ||. 

PROOF. Define the function 

M 

m( 0 ) = n - 1/2 Y, V, W,Eo,W ] ; V 

i =1 

Note that m(0 o ) = 0. Also, for any fixed 0 , E{m(0)} = 0 and by Lemma C.l, 

M 

CO v{m(G)} = n~ 1 / 2 {Y V ^ W i E ^ W i W i E ^ V ii } n ~ 1/2 1 16 P~% - 1,11%. 

i =1 

Thus, Markov’s inequality implies that for any fixed a > 0 and any fixed 0, 

Pr{||m(0)|| 2 > a} < lea^HS" 1 ^ - I q || 2 . (3) 

We will use this pointwise bound and the fact that m( 6 ) is linear in E 0 to bound the supremum. 

For k = 1,..., q, define E^k = £ (I + efce£)E , where e/ c denotes the fcth standard basis 
vector. Similarly, for l f k, define 'Em = E (I + e^ej + e%)£ . For 1 < k < l < q choose 

a 6 m such that E 0kl = £. For any parameter vector 6 , define 

t k i(0) = el(E~ 1,2 E e E^ 12 - I q )e V , 
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it follows that 

Se = S + ^2 tki(0)(Eki ~ S). 

k<i 

In particular, since m{6) is linear in Se, this implies that 

m(0) = 5 ~2 t ki{0)m(0 k i ). 

k<l 

Thus, the matrix Cauchy-Schwarz inequality (Lemma B.l) implies that 

\\m(6)\\ 2 < [^{<fci(0)} 2 ] [^j|m(0 fc ,)|| 2 • 
k<l k<l 

The first term on the right hand side is bounded by ||S 1 Sg — I q \\p which in turn is bounded 
by g||S 1 Sq — I q \\ 2 - The second term is bounded by q 2 maxfc<;||m(0fc/)|| 2 . 

From (3), for any a > 0, 

Pr{max||m(0fcz)|| 2 > a} < 16a~ 1 pq 2 . 
k<l 

Thus, if we set a = 16 e~ 1 pq 2 , then 

Pr{sup||m(0)|| 2 > 16£ _ V? 3 sup||S 1 fi 0 - I q || 2 } < e. 

0gb ogb 

Finally, Lemma C.l and the triangle inequality imply that 

||fi 1/2 jVn 1/2 || = un^nii < i + 4p||E _1 s 0 - i q \\. 

Putting these two bounds together gives the result of the lemma. 

Lemma C.4. Let function 83 ( 0 ) be defined as in (17c). If Assumptions 1-3 are in force, then 
for any £ > 0 and any parameter set B, 

Pr{sup||<$3(0)|| 2 > 256£ -1 pT 4 (l + Apr)} < e, 

9&B 

where r = sup eeB max{|jS 1 S e - I p ||, - Ip 11} and p = YnLi r i- 

Proof. Define 

M 

m(6) = n e 1/2 J2 V il W 0i E 0i W i E 0i W 1 i /2 'y i . 

i=1 

Applying Cauchy-Schwarz (Lemma B.l) and Lemma C.l, we get 

M 

\\m(e)\\ 2 < Y J ll^ 1 i /2 E 0i W i E 0i W 0i E 0i W i E 0i w] /2 li 

i =1 

M 

<max||FF i F; ei || 2 max||FF 4/2 £; ei FF 4 f|| 2 ^||7,:|| 2 . 

I l L - 

i= 1 
M 

< 256||E~ 1 S e - I,|| 3 ||S« *£ - I,|| ^||7 4 || 2 - 

2=1 
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Since E(J2iLi\\'Yi\\ 2 ) = p, Markov’s inequality gives that 

M 

Pr-Ell^H 2 - £ ”V} - £ ‘ 

i= 1 

Also, Lemma C.l gives that ||S~2 1 // 2 0 6> 1//2 || 2 = |jr2r2g 1 || < (l + 4p||X 1 Se —/ p ||). This gives the 
result of the lemma since 63 ( 6 ) = f l 1 ^ 2 Cl 9 1 ^m(d). 

D. Proof of Proposition 6.2 

Take r A to be any sequence of positive real numbers, and define the parameter set Bn = {6 = 
((/> 6 >,£e) : ||S 1 Yig - I q \\ < tn and || 1 S — I q \\ < iyv}, where S e = Assumption 4 

and Proposition 5.6 imply that if (N — pnY^tn —> 00 and {1 + {N/pn — l) _ 1 / 2 } w jV 2 r N “^ °°> 
then Pr(0 £ Bn) —> 1. If, in addition, tn —> 0 and p A r A —> 0, then Proposition 5.8 implies that 
n 1/2 0 § -0) 4 0. Setting t n = Pn 1 l°g Pn is sufficient to guarantee that these conditions hold. 

E. Stochastic gradient descent for /i-likelihood 

Reparameterizing the random effect covariance matrix as X = o 2 A~ , the logarithm of the 
/i-likelihood for the normal hierarchical model can be written as 

M rii 

h(P, A, er 2 , it | y) = 

i= 1 j= 1 

where 

hij = - log <7 - ^{Vij - xfjP ~ zJjUi) 2 - log |cr 2 A -1 1 ^—uJAui. 

For fixed A, let /3 A and w A denote the values of /3 and u that maximize h; note that these values 
do note depend on o 2 . 

If we set cr 2 = 1, then the gradients of the summands with respect to P and u t are 

V f3 flij = (Uij x ijP ZijUi) Xjj , 

^Ufhij — (.Vij x ijP Z^jlli) Zij (l/rij)AUj. 

To apply a variant of the Robbins and Monro (1951) stochastic gradient descent method to find 
P A and w A , we start with randomly chosen initial values /3° and u°. At each iteration, we 
perform N steps, processing all observations in random order. In the ttli step, when we process 
observation ij, we choose a learning rate at and perform the updates 

P 4+1 = P* + at ('Vij ~ xJjP * - zjjul) Xij, 

u i +1 = u i + oit{(yij - xJjP 1 - zjjul) z^, —(l/n, ; )Aw*}; 

for i’ 7 ^ i, we retain the value of the random effect vector u\,. Following Darken and Moody 
(1991), it is common to take the step size as 


at = a:o(l + at) 1 
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for appropriately chosen values «o and a. Following the recommendations of Y. LeCun and 
Muller (1998) and Bottou (2012), we use different learning rates for the fixed effect vector and 
for each random effect vector. For the ith random effect, vector, we set a = aodl-AY 1 ll~ 1 / n i)- 
For the fixed effects, we set a = ag/N. 

In our simulations, we set «o = 0.1/iV for the fixed effect vector and ao = 0.1 /rii for the ith 
random effect vector works well. We perform 30 iterations for fitting a linear model, and 100 
iterations for fitting a logistic model. 


F. Performance in linear regression simulations 

We perform a simulation analogous to the one described in Section 7, but with a hierarchical 
linear regression model instead of a hierarchical logistic model. 

As in Section 7, we simulate N samples drawn from M groups, with M — 1000 and N ranging 
from 100 to 100000. We set the dimensions of the fixed and random effect vectors to p = q = 5. 
For each choice of N, we perform 100 replicates. 

For each replicate, we generate the p-dimensional fixed effect, /?, and q x q random effect 
covariance matrix, S as before, but with a different scaling for the fixed effect. Specifically, for 
k = 1 ,... ,p, we draw 

betak independently from a t distribution with 4 degrees of freedom. We draw coefficient matrix 
S from an inverse Wishart distribution with shape I and 2 q degrees of freedom. 

We draw the group specific sample sizes n,; (i = 1,..., M), random effects «;(?' = 1,..., M), 
and predictor vectors Xij, z.ij (i = 1,..., M; j = 1,..., rij) as in Section 7. To generate response 
value yij, we draw from a normal distribution with mean /i,; y = xJj/3 + zj 3 u t and variance rf> = 1. 

We compute moment-based estimates of the population parameters /3 and S, along with 
approximate empirical Bayes estimates Ut (i = 1,..., M) using the procedure described in Sec¬ 
tion 3.4. We compare the following methods: 

(a) mhglm, the proposed moment-based estimation procedure. To compute the moment-based 
estimates, we use two-step estimators with semi-weighted initial step and S 0 set to the 
identity matrix, after standardizing the predictors. The procedure is implemented in the 
R programming language. 

(b) Imer , maximum likelihood, using a gradient-free optimization procedure applied to the 
profiled likelihood, implemented in C++ and R by the lme4 R package (Bates et al., 
2013). 

(c) sgd, which uses stochastic gradient descent to maximize a regularized version of the h- 
likelihood. We implemented the compute-intensive inner loop in C, and the outer loop 
in R. 

(d) Imer split, a data-splitting estimation procedure, which splits the data set into 10 subsets, 
computes separate estimates for each using Imer, and then combines the estimates by 
averaging them. Implemented in R. 

(e) Ime, maximum likelihood, using a combination of Expectation-Maximization and New¬ 
ton Raphson iteration, implemented in the C and R programming languages by the nlme 
package (Pinheiro et ah, 2014). 

As in Section 3.4, we report serial computation time for each procedure, and we do not include 
cross-validation time for the tuning parameter selection for the sgd method. 

We use the same fixed effect, random effect covariance, and random effect loss as in Section 7. 



Log 10 Seconds Log 10 Mean Loss 
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Fig. 1 . Performance for the hierarchical linear model. Circle radii indicate one standard error along y- 
(absent when smaller than line width). 
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For prediction loss, we use the squared error loss 

M m 

n y y ' <i> (pij — Pij) 

»=i j =i 

where pij = xF/3 + zT-u, and /i .= xF0 + zJjUi. 

Fig. 1 shows the mean loss, averaged over all replicates, with circle radii indicating standard 
errors along the vertical axes (absent when less than the visible line width). For small sample 
sizes, the methods based on maximum likelihood ( Ime and Imer ) perform slightly better than 
the proposed method ( mhglm ), but for large sample sizes, these methods all have similar loss 
performances. The Ime, Imer, Imer. split, and mhglm methods all appear to give consistent 
estimates, with the Imer.split being less statistically efficient than that other three methods. 
The sgd method does not see a noticeable improvement in loss performance over the range of 
sample sizes considered. 

The lower-left panel of Fig. 1 shows the sequential computation times for all methods. In 
terms of computation time, for most values of N in the simulation, the proposed method performs 
better than the likelihood-based procedures. However, at N = 10 4 75 and N = 10 5 , the proposed 
method has the same average running time as the Ime method. 

In terms of computation time, the sgd procedure is clearly the fastest, followed by the pro¬ 
posed moment-based approach, and then the likelihood-based methods. In terms of statistical 
efficiency, exact maximum likelihood procedures and the proposed moment-based procedures 
perform best, followed by Imer.split and the sgd method. The sgd method is much faster than 
the other methods under consideration, but the computational gains are paid for with reduced 
statistical efficiency. The mhglm is not as fast as the sgd method, but in terms of statistical 
efficiency, it is competitive with the exact maximum likelihood methods. 
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